The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this paper we present a generic, language independent multi-document summarization system forming extracts using the cover coefficient concept. Cover Coefficient-based Summarizer (CCS) uses similarity between sentences to determine representative sentences. Experiments indicate that CCS is an efficient algorithm that is able to generate quality summaries online.
We compare the term- and document-centric static index pruning approaches as described in the literature and investigate their sensitivity to the scoring functions employed during the pruning and actual retrieval stages.
The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n-gram based distributed model for retrieval on degraded text large collections. Evaluation was carried out with both the TREC Confusion Track and Legal Track collections showing that the presented approach outperforms in...
The LETOR website contains three information retrieval datasets used as a benchmark for testing machine learning ideas for ranking. Participating algorithms are measured using standard IR ranking measures (NDCG, precision, MAP). Similarly to other participating algorithms, we train a linear classifier. In contrast, we define an additional free benchmark variable for each query. This allows expressing...
Overall query execution time consists of the time spent transferring data from disk to memory, and the time spent performing actual computation. In any measurement of overall time on a given hardware configuration, the two separate costs are aggregated. This makes it hard to reproduce results and to infer which of the two costs is actually affected by modifications proposed by researchers. In this...
When automatic plagiarism detection is carried out considering a reference corpus, a suspicious text is compared to a set of original documents in order to relate the plagiarised text fragments to their potential source. One of the biggest difficulties in this task is to locate plagiarised fragments that have been modified (by rewording, insertion or deletion, for example) from the source text. ...
In this paper, we study how to automatically exploit visual concepts in a text-based image retrieval task. First, we use Forest of Fuzzy Decision Trees (FFDTs) to automatically annotate images with visual concepts. Second, using optionally WordNet, we match visual concepts and textual query. Finally, we filter the text-based image retrieval result list using the FFDTs. This study is performed in the...
This paper describes usage of MT metrics in choosing the best candidates for MT-based query translation resources. Our main metrics is METEOR, but we also use NIST and BLEU. Language pair of our evaluation is English → German, because MT metrics still do not offer very many language pairs for comparison. We evaluated translations of CLEF 2003 topics of four different MT programs with MT metrics and...
We propose a new entropy-based algorithm for static index pruning. The algorithm computes an importance score for each document in the collection based on the entropy of each term. A threshold is set according to the desired level of pruning and all postings associated with documents that score below this threshold are removed from the index, i.e. documents are removed from the collection. We compare...
This poster presents a novel way to represent user navigation in XML retrieval using collection statistics from XML summaries. Currently, developing user navigation models in XML retrieval is costly and the models are specific to collected user assessments. We address this problem by proposing summary navigation models which describe user navigation in terms of XML summaries. We develop our proposal...
In this paper, we address the problem of generating a query-specific extractive summary in a an efficient manner for a given set of documents. In many of the current solutions, the entire collection of documents is modeled as a single graph which is used for summary generation. Unlike these approaches, in this paper, we model each individual document as a graph and generate a query-specific summary...
The Opinion Detection from blogs has always been a challenge for researchers. One of the challenges faced is to find such documents that specifically contain opinion on users’ information need. This requires text processing on sentence level rather than on document level. In this paper, we have proposed an opinion detection approach. The proposed approach focuses on above problem by processing documents...
Lexicon-based approaches have been widely used for opinion retrieval due to their simplicity. However, no previous work has focused on the domain-dependency problem in opinion lexicon construction. This paper proposes simple feedback-style learning for query-specific opinion lexicon using the set of top-retrieved documents in response to a query. The proposed learning starts from the initial domain-independent...
This paper explores the use of implicit user feedback in adapting the underlying domain model of an intranet search system. The domain model, a Formal Concept Analysis (FCA) lattice, is used as an interactive interface to allow user exploration of the context of an intranet query. Implicit user feedback is harnessed here to surmount the difficulty of achieving optimum document descriptors, essential...
Evaluating complex system is a complex task. Evaluation campaigns are organized each year to test different systems on global results, but they do not evaluate the relevance of the criteria used. Our purpose consist in modifying the intermediate results created by the components and inserting the new results into the process, without modifying the components. We will describe our framework of glass-box...
CBIR has been a challenging problem and its performance relies on the underlying image similarity (distance) metric. Most existing metrics evaluate pairwise image similarity based only on image content, which is denoted as content similarity. In this study we propose a novel similarity metric to make use of the image contexts in an image collection. The context of an image is built by constructing...
Errors in speech recognition transcripts have a negative impact on effectiveness of content-based speech retrieval and present a particular challenge for collections containing conversational spoken content. We propose a Global Semantic Distortion (GSD) metric that measures the collection-wide impact of speech recognition error on spoken content retrieval in a query-independent manner. We deploy our...
We present a class of models that are discriminatively trained to directly map from the word content in a query-document or document- document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, pol- ysemy). However, unlike LSI our models are trained with a supervised signal directly on the task of interest, which we argue...
Segmenting videos into smaller, semantically related segments which ease the access of the video data is a challenging open research. In this paper, we present a scheme for semantic story segmentation based on anchor person detection. The proposed model makes use of a split and merge mechanism to find story boundaries. The approach is based on visual features and text transcripts. The performance...
We propose a method by means of which supervised learning algorithms that only accept binary input can be extended to use ordinal (i.e., integer-valued) input. This is much needed in text classification, since it becomes thus possible to endow these learning devices with term frequency information, rather than just information on the presence/absence of the term in the document. We test two different...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.